Enable "index file" reads for catalog import #334

delucchi-cmu · 2024-06-11T14:15:21Z

Change Description

Closes #308 .

Solution Description

Creates a new kind of file reader for catalog import: indexed file reader. This uses a single "index" file as a task unit, and these files contain only paths to data files to be read. This enables batching many small input data files into larger chunks for the map and reduce stages of the pipeline.

Implements an indexed reader for CSV and for Parquet files. In particular, the parquet reader utilizes pyarrow's parquet read batch_readahead, fragment_readahead, and multi-threading to further speed up reads of many small data files.

Code Quality

I have read the Contribution Guide and LINCC Frameworks Code of Conduct
My code follows the code style of this project
My code builds (or compiles) cleanly without any errors or warnings
My code contains relevant comments and necessary documentation

codecov · 2024-06-11T14:20:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.79%. Comparing base (04c1a8f) to head (c45a6dc).
Report is 40 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #334   +/-   ##
=======================================
  Coverage   99.78%   99.79%           
=======================================
  Files          26       26           
  Lines        1389     1442   +53     
=======================================
+ Hits         1386     1439   +53     
  Misses          3        3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jeremykubica

Two high level questions.

src/hipscat_import/catalog/file_readers.py

delucchi-cmu and others added 5 commits May 24, 2024 11:31

Indexed CSV reads.

ce7a7f9

Indexed parquet reads.

37ed857

implement dataset.to_batches method in IndexedParquetReader

c4b6732

Use file_io dataset read for cloud URIs

db62e1e

Merge branch 'main' into issue/308/indexed

daf153a

delucchi-cmu requested a review from jeremykubica June 11, 2024 14:26

jeremykubica reviewed Jun 11, 2024

View reviewed changes

src/hipscat_import/catalog/file_readers.py Show resolved Hide resolved

src/hipscat_import/catalog/file_readers.py Show resolved Hide resolved

Add documentation on index batching

c45a6dc

jeremykubica approved these changes Jun 12, 2024

View reviewed changes

delucchi-cmu merged commit 33d70e7 into main Jun 12, 2024
9 checks passed

delucchi-cmu deleted the issue/308/indexed branch June 12, 2024 17:54

delucchi-cmu mentioned this pull request Jun 13, 2024

Tests for indexed readers astronomy-commons/hats-cloudtests#30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable "index file" reads for catalog import #334

Enable "index file" reads for catalog import #334

delucchi-cmu commented Jun 11, 2024

codecov bot commented Jun 11, 2024 •

edited

Loading

jeremykubica left a comment

Enable "index file" reads for catalog import #334

Enable "index file" reads for catalog import #334

Conversation

delucchi-cmu commented Jun 11, 2024

Change Description

Solution Description

Code Quality

codecov bot commented Jun 11, 2024 • edited Loading

Codecov Report

jeremykubica left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 11, 2024 •

edited

Loading